Building a song recommender

Fire up GraphLab Create


In [1]:
import graphlab

Load music data


In [2]:
song_data = graphlab.SFrame('song_data.gl/')


[INFO] This non-commercial license of GraphLab Create is assigned to j.ryan.rembert@gmail.com and will expire on October 13, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-2685 - Server binary: /Users/jrrembert/venvs/dato-env/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1447645794.log
[INFO] GraphLab Server Version: 1.6.1

Explore data

Music data shows how many times a user listened to a song, as well as the details of the song.


In [3]:
song_data.head()


Out[3]:
user_id song_id listen_count title artist
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOAKIMP12A8C130995 1 The Cove Jack Johnson
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOBBMDR12A8C13253B 2 Entre Dos Aguas Paco De Lucia
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOBXHDL12A81C204C0 1 Stronger Kanye West
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOBYHAJ12A6701BF1D 1 Constellations Jack Johnson
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SODACBL12A8C13C273 1 Learn To Fly Foo Fighters
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SODDNQT12A6D4F5F7E 5 Apuesta Por El Rock 'N'
Roll ...
Héroes del Silencio
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SODXRTY12AB0180F3B 1 Paper Gangsta Lady GaGa
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOFGUAY12AB017B0A8 1 Stacked Actors Foo Fighters
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOFRQTD12A81C233C0 1 Sehr kosmisch Harmonia
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOHQWYZ12A6D4FA701 1 Heaven's gonna burn your
eyes ...
Thievery Corporation
feat. Emiliana Torrini ...
song
The Cove - Jack Johnson
Entre Dos Aguas - Paco De
Lucia ...
Stronger - Kanye West
Constellations - Jack
Johnson ...
Learn To Fly - Foo
Fighters ...
Apuesta Por El Rock 'N'
Roll - Héroes del ...
Paper Gangsta - Lady GaGa
Stacked Actors - Foo
Fighters ...
Sehr kosmisch - Harmonia
Heaven's gonna burn your
eyes - Thievery ...
[10 rows x 6 columns]


In [4]:
graphlab.canvas.set_target('ipynb')

In [5]:
song_data['song'].show()



In [6]:
len(song_data)


Out[6]:
1116609

Count number of unique users in the dataset


In [7]:
users = song_data['user_id'].unique()

In [8]:
len(users)


Out[8]:
66346

Q1: Find artist with most number of unique listeners


In [20]:
kanye_songs = song_data[song_data['artist'] == 'Kanye West']
foo_songs = song_data[song_data['artist'] == 'Foo Fighters']
swift_songs = song_data[song_data['artist'] == 'Taylor Swift']
gaga_songs = song_data[song_data['artist'] == 'Lady GaGa']

In [21]:
kanye_users = kanye_songs['user_id'].unique()
foo_users = foo_songs['user_id'].unique()
swift_users = swift_songs['user_id'].unique()
gaga_users = gaga_songs['user_id'].unique()

In [25]:
print "{} {} {} {}".format(len(kanye_users), len(foo_users), len(swift_users), len(gaga_users))


2522 2055 3246 2928

Q2: Find artist with most listeners


In [29]:
listener_count = song_data.groupby(key_columns='artist', operations={'total_count': graphlab.aggregate.SUM('listen_count')})

In [33]:
listener_count = listener_count.sort('total_count', ascending=False)
listener_count.head()


Out[33]:
artist total_count
Kings Of Leon 43218
Dwight Yoakam 40619
Björk 38889
Coldplay 35362
Florence + The Machine 33387
Justin Bieber 29715
Alliance Ethnik 26689
OneRepublic 25754
Train 25402
The Black Keys 22184
[10 rows x 2 columns]

Q3: Artist with least listeners


In [34]:
listener_count = listener_count.sort('total_count', ascending=True)
listener_count.head()


Out[34]:
artist total_count
William Tabbert 14
Reel Feelings 24
Beyoncé feat. Bun B and
Slim Thug ...
26
Boggle Karaoke 30
Diplo 30
harvey summers 31
Nâdiya 36
Aneta Langerova 38
Jody Bernal 38
Kanye West / Talib Kweli
/ Q-Tip / Common / ...
38
[10 rows x 2 columns]

Create a song recommender


In [ ]:


In [9]:
train_data,test_data = song_data.random_split(.8,seed=0)

Simple popularity-based recommender


In [10]:
popularity_model = graphlab.popularity_recommender.create(train_data,
                                                         user_id='user_id',
                                                         item_id='song')


PROGRESS: Recsys training: model = popularity
PROGRESS: Warning: Ignoring columns song_id, listen_count, title, artist;
PROGRESS:     To use one of these as a target column, set target = <column_name>
PROGRESS:     and use a method that allows the use of a target.
PROGRESS: Preparing data set.
PROGRESS:     Data has 893580 observations with 66085 users and 9952 items.
PROGRESS:     Data prepared in: 0.763742s
PROGRESS: 893580 observations to process; with 9952 unique items.

Use the popularity model to make some predictions

A popularity model makes the same prediction for all users, so provides no personalization.


In [ ]:
popularity_model.recommend(users=[users[0]])

In [ ]:
popularity_model.recommend(users=[users[1]])

Build a song recommender with personalization

We now create a model that allows us to make personalized recommendations to each user.


In [ ]:
personalized_model = graphlab.item_similarity_recommender.create(train_data,
                                                                user_id='user_id',
                                                                item_id='song')

Applying the personalized model to make song recommendations

As you can see, different users get different recommendations now.


In [ ]:
personalized_model.recommend(users=[users[0]])

In [ ]:
personalized_model.recommend(users=[users[1]])

We can also apply the model to find similar songs to any song in the dataset


In [ ]:
personalized_model.get_similar_items(['With Or Without You - U2'])

In [ ]:
personalized_model.get_similar_items(['Chan Chan (Live) - Buena Vista Social Club'])

Quantitative comparison between the models

We now formally compare the popularity and the personalized models using precision-recall curves.


In [ ]:
if graphlab.version[:3] >= "1.6":
    model_performance = graphlab.compare(test_data, [popularity_model, personalized_model], user_sample=0.05)
    graphlab.show_comparison(model_performance,[popularity_model, personalized_model])
else:
    %matplotlib inline
    model_performance = graphlab.recommender.util.compare_models(test_data, [popularity_model, personalized_model], user_sample=.05)

The curve shows that the personalized model provides much better performance.

Q4: Find most recommended song to the first 10k users


In [35]:
train_data,test_data = song_data.random_split(.8,seed=0)

In [39]:
item_similarity_model = graphlab.item_similarity_recommender.create(train_data,
                                                             user_id='user_id',
                                                             item_id='song')


PROGRESS: Recsys training: model = item_similarity
PROGRESS: Warning: Ignoring columns song_id, listen_count, title, artist;
PROGRESS:     To use one of these as a target column, set target = <column_name>
PROGRESS:     and use a method that allows the use of a target.
PROGRESS: Preparing data set.
PROGRESS:     Data has 893580 observations with 66085 users and 9952 items.
PROGRESS:     Data prepared in: 0.804962s
PROGRESS: Computing item similarity statistics:
PROGRESS: Computing most similar items for 9952 items:
PROGRESS: +-----------------+-----------------+
PROGRESS: | Number of items | Elapsed Time    |
PROGRESS: +-----------------+-----------------+
PROGRESS: | 1000            | 1.46503         |
PROGRESS: | 2000            | 1.49916         |
PROGRESS: | 3000            | 1.5328          |
PROGRESS: | 4000            | 1.56653         |
PROGRESS: | 5000            | 1.59924         |
PROGRESS: | 6000            | 1.63177         |
PROGRESS: | 7000            | 1.66651         |
PROGRESS: | 8000            | 1.70576         |
PROGRESS: | 9000            | 1.75355         |
PROGRESS: +-----------------+-----------------+
PROGRESS: Finished training in 2.10749s

In [40]:
subset_test_users = test_data['user_id'].unique()[0:10000]

In [43]:
rec_songs = item_similarity_model.recommend(subset_test_users,k=1)


PROGRESS: recommendations finished on 1000/10000 queries. users per second: 1567.02
PROGRESS: recommendations finished on 2000/10000 queries. users per second: 1514.73
PROGRESS: recommendations finished on 3000/10000 queries. users per second: 1550.84
PROGRESS: recommendations finished on 4000/10000 queries. users per second: 1567.67
PROGRESS: recommendations finished on 5000/10000 queries. users per second: 1573.07
PROGRESS: recommendations finished on 6000/10000 queries. users per second: 1581.12
PROGRESS: recommendations finished on 7000/10000 queries. users per second: 1593.36
PROGRESS: recommendations finished on 8000/10000 queries. users per second: 1600.14
PROGRESS: recommendations finished on 9000/10000 queries. users per second: 1596.26
PROGRESS: recommendations finished on 10000/10000 queries. users per second: 1587.42

In [45]:
rec_songs.head()


Out[45]:
user_id song score rank
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Grind With Me (Explicit
Version) - Pretty Ricky ...
0.0459424433009 1
696787172dd3f5169dc94deef
97e427cee86147d ...
Senza Una Donna (Without
A Woman) - Zucchero / ...
0.0170265780731 1
532e98155cbfd1e1a474a28ed
96e59e50f7c5baf ...
Jive Talkin' (Album
Version) - Bee Gees ...
0.0118288659232 1
18325842a941bc58449ee71d6
59a08d1c1bd2383 ...
Goodnight And Goodbye -
Jonas Brothers ...
0.0168060865646 1
507433946f534f5d25ad1be30
2edb9a2376f503c ...
Find The Cost Of Freedom
- Crosby_ Stills_ Nash & ...
0.0165806601546 1
18fafad477f9d72ff86f7d0bd
838a6573de0f64a ...
Rabbit Heart (Raise It
Up) - Florence + The ...
0.0799450902285 1
fe85b96ba1983219b296f6b48
69dd29eb2b72ff9 ...
Secrets - OneRepublic 0.079137043826 1
225ea420b4bede50919d1bfe2
4a599691522d176 ...
Alejandro - Lady GaGa 0.0273359193626 1
95dc7e2b188b1148b2d25f4e6
b6e94afacc4efc3 ...
Bust a Move - Infected
Mushroom ...
0.0534825732628 1
4a3a1ae2748f12f7ab921a47d
6d79abf82e3e325 ...
Isis (Spam Remix) -
Alaska Y Dinarama ...
0.0418030208549 1
[10 rows x 4 columns]


In [47]:
song_count = rec_songs.groupby(key_columns='song', operations={'count': graphlab.aggregate.COUNT()})

In [49]:
song_count.sort('count', ascending=False)


Out[49]:
song count
Undo - Björk 432
Secrets - OneRepublic 374
Revelry - Kings Of Leon 233
You're The One - Dwight
Yoakam ...
165
Fireflies - Charttraxx
Karaoke ...
118
Hey_ Soul Sister - Train 106
Horn Concerto No. 4 in E
flat K495: II. Romance ...
92
Sehr kosmisch - Harmonia 85
OMG - Usher featuring
will.i.am ...
63
Clocks - Coldplay 49
[3142 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [ ]: